Feat: Don't retry function upload on 400 or 422 status #478

erikw · 2023-07-20T13:12:21Z

# WIP
Don't retry function upload on 4xx. A 4xx response

When a function upload receives a 400 or 422 status code, we know that retrying the upload will fail. This change returns early rather than retrying.

netlify · 2023-07-20T13:12:44Z

✅ Deploy Preview for open-api ready!

Name	Link
🔨 Latest commit	`25f5b63`
🔍 Latest deploy log	https://app.netlify.com/sites/open-api/deploys/650ad802998da00008fed3bd
😎 Deploy Preview	https://deploy-preview-478--open-api.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

go/porcelain/deploy.go

JGAntunes

The direction looks right 👍

Something I would suggest we had is a property (a boolean flag of some sort) that we can use to enable this new behaviour and that way feature flag it on Buildbot's side.

JGAntunes · 2023-08-23T10:03:43Z

go/porcelain/deploy.go

+				}
+
+				// TODO this changes retry behaviour for the fileUpload case as well. OK?
+				if apiErr.Code()/100 == 4 {


I would voice the same concern as @biruwon pointed out here. I'm concerned that retrying all the 4xx calls might be a stretch? (i.e. should we retry 429's too?)

Updated to just skip the retry for 400 and 422 status codes 🙂

jenae-janzen · 2023-08-23T20:07:16Z

Something I would suggest we had is a property (a boolean flag of some sort) that we can use to enable this new behaviour and that way feature flag it on Buildbot's side.

I added this to the options struct, but am also open to passing it in as a parameter. I started adding it in Buildbot here, but it's not set up quite right yet and I haven't set up the flag in DevCycle yet.

jenae-janzen · 2023-08-24T18:41:22Z

This is ready for review. We set skipRetry to false by default if no value is received for the feature flag. The accompanying change is also ready for review, but depends on this change going out first because this is where options is defined.

go/porcelain/deploy.go

JGAntunes

Left some comments, let me know what you all think 👍

JGAntunes · 2023-08-25T16:55:28Z

go/porcelain/deploy.go

+					sharedErr.mutex.Lock()
+					sharedErr.err = operationError
+					sharedErr.mutex.Unlock()
+					return nil


So I had to spend some time looking through this but I believe there's some room for improvement here (in the retry strategy I mean 😅)

I was initially going to say that we didn't want to return nil here but then I got that the idea is to actually exit the retry early. We have a check at the beggining of the retry method where we error out if sharedErr is set:

open-api/go/porcelain/deploy.go

Lines 493 to 496 in 99d733e

if sharedErr.err != nil {

sharedErr.mutex.Unlock()

return fmt.Errorf("aborting upload of file %s due to failed upload of another file", f.Name)

}

However this case is also unideal because it just means we'll keep on retrying and erroring out here until we exhaust the count.

I was reading through the docs for backoff - https://pkg.go.dev/github.com/cenkalti/backoff/v4#Retry - and it seems like they have support for a PermanentError - https://pkg.go.dev/github.com/cenkalti/backoff/v4#PermanentError - which is a way for us to exit out of the retry loop while keeping the "error semantics" correct. It seems like this would be the perfect fit for this case? What do you think?

JGAntunes · 2023-08-25T16:56:34Z

go/porcelain/deploy_test.go

@@ -287,7 +287,7 @@ func TestUploadFiles_Cancelation(t *testing.T) {
 	for _, bundle := range files.Files {
 		d.Required = append(d.Required, bundle.Sum)
 	}
-	err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute)
+	err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute, false)


Wondering if we could add a test here which exercises the new code paths we introduced (i.e. checking we don't retry 400 and 422 errors when uploading).

@JGAntunes do you have any pointers for the testing? I seem to be a bit stuck here. We are also having trouble running the test suite locally, which is slowing this down a bit 😬

about running the test suite locally, if it seems like nothing is running it's probably because of the retries loop.

try running the tests as:

go test -race github.com/netlify/open-api/v2/go/porcelain --run TestUploadFiles400Errors -v

And you should get some output (hopefully)

Thanks for the tip @4xposed!! We got the tests running and can see the output which has been very helpful :D

4xposed · 2023-08-31T15:05:43Z

go/porcelain/deploy.go

 	sharedErr := &uploadError{err: nil, mutex: &sync.Mutex{}}
+	permanentErr := &backoff.PermanentError{Err: nil}


It's a bit unclear for me why we instantiate permanentErr here, as could you elaborate?

as far as I know backoff.PermanentError is to be used as return from a backoff.Retry() block

Initially we weren't sure how to use PermanentError, but your other comment pointed us in the right direction I think

4xposed · 2023-08-31T15:06:49Z

go/porcelain/deploy.go

+				}
+
+				if skipRetry && (apiErr.Code() == 400 || apiErr.Code() == 422) {
+					operationError = permanentErr


I might be wrong, but I think we should wrap the existing error in a backoff.PermanentError{} (for which backoff provides the Permanent() function:

Suggested change

operationError = permanentErr

operationError = backoff.Permanent(operationError)

Ah, I think that does make sense. And then when backoff receives the Permanent error, it should stop retrying, if I understand correctly.

4xposed · 2023-08-31T15:09:07Z

go/porcelain/deploy_test.go

+	ctx := gocontext.Background()
+
+	server := httptest.NewServer(http.HandlerFunc(func(rw http.ResponseWriter, _ *http.Request) {
+		rw.WriteHeader(http.StatusUnprocessableEntity)


I think to properly assert the error message in the requre.Equal in line 350 we need to make the mock server to return a body with a json that has code: and message keys:

Suggested change

rw.WriteHeader(http.StatusUnprocessableEntity)

rw.WriteHeader(http.StatusUnprocessableEntity)

rw.Header().Set("Content-Type", "application/json; charset=utf-8")

rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422}`))

This is super helpful and definitely makes sense! -- looking at other test cases I see this is done similarly. However, when we set rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422}`)) we keep getting the error&{0 } (*models.Error) is not supported by the TextConsumer, can be resolved by supporting TextUnmarshaler interface.

So then the apiError isn't set and then our test skips over the code we're trying to test and keeps retrying 🙈

Looking at other examples in other tests, it seems like it works to set the response this way when calling client.Operations.UploadDeployFile, but not when calling client.uploadFiles -- when calling uploadFiles the error is thrown from UploadDeployFile.

We've been trying to debug this by writing the body different ways, trying to marshal the response etc, but haven't had any success yet 🤔 Is there maybe something silly we're missing?

This is where the error is being thrown from: https://github.com/netlify/open-api/blob/master/go/plumbing/operations/operations_client.go#L4131

I think It might be the order on which those are set, mockserver is a bit quirky at times. try if setting the headers in this order helps:

rw.Header().Set("Content-Type", "application/json; charset=utf-8") rw.WriteHeader(http.StatusUnprocessableEntity) rw.Write([]byte(`{"message": "Unprocessable Entity", "code": 422 }`))

👋 @4xposed thanks so much for the tip! That worked and now as far as I can tell, the server is responding as expected. However, now we're hitting a different problem and struggling to debug it.

We now receive the error with the correct code and message, but when we get to this line
apiErr, ok := operationError.(apierrors.Error)
no matter what we try, apiErr is always nil and ok is always false.

operationError is an error that looks like:
error(*github.com/netlify/open-api/v2/go/plumbing/operations.UploadDeployFunctionDefault) *{_statusCode: 422, Payload: *github.com/netlify/open-api/v2/go/models.Error {Code: 422, Message: "Unprocessable Entity"}}

In my debugger, I can access the status code by calling either operationError._statusCode or operationError.Payload.Code, but not by calling operationError.(apierrors.Error). However, I can't call operationError.Payload.Code in the actual code, because Payload/Code isn't defined on the Error type.

So since operationError.(apierrors.Error) never seems to evaluate to anything and there wasn't a test for the original code, we're uncertain if we're doing something wrong here, or if the original code was actually working as intended.

Are we missing something here? Is there a better way to test this (perhaps non-locally?)

cc @jrwhitmer @JGAntunes

@4xposed Thanks, I really appreciate the offer! I'm in Europe now 🎉

Testing manually using the debugger, it looks like everything's working as expected, thanks so much for the fix. The one last thing I'm trying to do to wrap up this PR is to fix my tests:

require.Equal(t, err, "[PUT /deploys/{deploy_id}/files/{path}][422] uploadDeployFile default &{Code:422 Message: Unprocessable Entity}")

doesn't work because the error's now wrapped by the backoff.Permanent error. Is it sufficient to just requireError(t, err) here? Or is there a better way to assert this that I'm missing?

Pairing might not be necessary since most of this seems to be working, but I'd appreciate you taking another look when you have the time 🙂

we could call err.Unwrap() seems like backoff.PermanentError has a few methods for this: https://github.com/cenkalti/backoff/blob/v4/retry.go#L129

I don't think it's strictly necessary, but it would be nice to at least have some way that we have the right error and not any error, if that makes sense.

What do you think?

I'll give it a shot! I do agree it would be nice to have the right error if possible

Okay, I used Require. ErrorContains instead and that seemed to do the trick for verifying we have the correct error message

go/porcelain/deploy_test.go

jenae-janzen · 2023-08-31T22:53:04Z

go/porcelain/deploy_test.go

+	// Set SkipRetry to false
+	err = client.uploadFiles(ctx, d, files, nil, fileUpload, time.Minute, false)
+	// require.Equal(t, err, "[PUT /deploys/{deploy_id}/files/{path}][422] uploadDeployFile default  &{Code:422 Message: Unprocessable Entity}")
+	require.Equal(t, attempts, 12)


The number of attempts doesn't seem to be consistent, so maybe there's a better way to assert this

@4xposed this is one other tiny detail that sometimes makes the test fail - every once in a while it retries 13 times 🤔

yes I think (emphasis on think) it retries infinitely until the context times out or it's cancelled.

I think checking that there's more than 1 attempt (or some arbitrary number that's not too high) might be enough, maybe?

I will try to look more into the retry mechanism tomorrow morning to try to figure out if we can have more control in the test of how many times we retry

I added a check to just make sure that the attempts are greater than 1 :) That should probably be sufficient for what we're testing

Prototype

a12f834

erikw added 2 commits July 20, 2023 15:12

fix

6eaf7b2

Comment

2f75b17

JGAntunes reviewed Aug 23, 2023

View reviewed changes

go/porcelain/deploy.go Outdated Show resolved Hide resolved

JGAntunes reviewed Aug 23, 2023

View reviewed changes

jenae-janzen added 4 commits August 23, 2023 13:20

stop retrying on 400 and 422 status codes

7d7c6a7

remove todo comment

03e50c8

add SkipRetry ff to options struct

279e58b

small fix

7d176fb

jenae-janzen marked this pull request as ready for review August 24, 2023 18:35

jenae-janzen requested a review from a team as a code owner August 24, 2023 18:35

jenae-janzen changed the title ~~Don't retry function upload on 4xx~~ Feat: Don't retry function upload on 400 or 422 status Aug 24, 2023

netlify deleted a comment from conventional-commit-lint-gcf bot Aug 24, 2023

eduardoboucas reviewed Aug 24, 2023

View reviewed changes

go/porcelain/deploy.go Outdated Show resolved Hide resolved

go/porcelain/deploy.go Show resolved Hide resolved

JGAntunes reviewed Aug 25, 2023

View reviewed changes

JGAntunes mentioned this pull request Aug 28, 2023

Don't retry file uploads with 422 responses #476

Closed

jenae-janzen and others added 7 commits August 29, 2023 14:45

use permanent error

de2b8c1

test

9b326af

test part 3

6248a62

ff on for test

49e8d71

test part 4

0116127

test part 5

047bf94

test part 6

2192c51

4xposed reviewed Aug 31, 2023

View reviewed changes

go/porcelain/deploy_test.go Outdated Show resolved Hide resolved

4xposed reviewed Aug 31, 2023

View reviewed changes

go/porcelain/deploy_test.go Show resolved Hide resolved

return permanent error + work on tests

1367935

jenae-janzen reviewed Aug 31, 2023

View reviewed changes

go/porcelain/deploy_test.go Outdated Show resolved Hide resolved

jenae-janzen reviewed Aug 31, 2023

View reviewed changes

move comment out of code

75e7e49

4xposed mentioned this pull request Sep 7, 2023

fix: fileUpload error casting does not conform to interface #486

Merged

jenae-janzen added 5 commits September 18, 2023 09:49

save changes

107dbe6

Merge branch 'master' into erikw/lambda-error-msg

0276651

assert error

ff30a75

missed one

10097e7

fix tests

25f5b63

jrwhitmer approved these changes Sep 21, 2023

View reviewed changes

jenae-janzen merged commit 9b75de7 into master Sep 22, 2023
14 checks passed

jenae-janzen deleted the erikw/lambda-error-msg branch September 22, 2023 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Don't retry function upload on 400 or 422 status #478

Feat: Don't retry function upload on 400 or 422 status #478

erikw commented Jul 20, 2023 •

edited by jenae-janzen

Loading

netlify bot commented Jul 20, 2023 •

edited

Loading

JGAntunes left a comment

JGAntunes Aug 23, 2023

jenae-janzen Aug 23, 2023

jenae-janzen commented Aug 23, 2023

jenae-janzen commented Aug 24, 2023

JGAntunes left a comment

JGAntunes Aug 25, 2023

JGAntunes Aug 25, 2023

jrwhitmer Aug 30, 2023

4xposed Aug 31, 2023

jenae-janzen Aug 31, 2023

4xposed Aug 31, 2023

jenae-janzen Aug 31, 2023

4xposed Aug 31, 2023

jenae-janzen Aug 31, 2023

4xposed Aug 31, 2023 •

edited

Loading

jenae-janzen Aug 31, 2023

jenae-janzen Aug 31, 2023

4xposed Sep 1, 2023

jenae-janzen Sep 5, 2023

jenae-janzen Sep 19, 2023

jenae-janzen Sep 19, 2023

4xposed Sep 19, 2023

jenae-janzen Sep 19, 2023

jenae-janzen Sep 21, 2023

jenae-janzen Aug 31, 2023

jenae-janzen Sep 19, 2023

4xposed Sep 19, 2023

jenae-janzen Sep 21, 2023

	if sharedErr.err != nil {
	sharedErr.mutex.Unlock()
	return fmt.Errorf("aborting upload of file %s due to failed upload of another file", f.Name)
	}

		sharedErr := &uploadError{err: nil, mutex: &sync.Mutex{}}
		permanentErr := &backoff.PermanentError{Err: nil}

	operationError = permanentErr
	operationError = backoff.Permanent(operationError)

Feat: Don't retry function upload on 400 or 422 status #478

Feat: Don't retry function upload on 400 or 422 status #478

Conversation

erikw commented Jul 20, 2023 • edited by jenae-janzen Loading

netlify bot commented Jul 20, 2023 • edited Loading

✅ Deploy Preview for open-api ready!

JGAntunes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jenae-janzen commented Aug 23, 2023

jenae-janzen commented Aug 24, 2023

JGAntunes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

4xposed Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikw commented Jul 20, 2023 •

edited by jenae-janzen

Loading

netlify bot commented Jul 20, 2023 •

edited

Loading

4xposed Aug 31, 2023 •

edited

Loading